December 21, 2018

December 21, 2018

Citi Bike and the rebalancing problem

NYC’s Citi Bike is the largest bike share system in the country with over 800 stations. An inherent challenge in operating a bikeshare system is the uneven distribution of demand and the resultant difficulty in ensuring the availability of bikes or docks for users where and when they need them. To cope with this, bikeshare operators manually redistribute or rebalance bikes based on spatial and temporal demand trends.

This map shows the extent of Citi Bike stations, which spans three boroughs and also includes Jersey City. For the purposes of this study, we focus on 736 docks in New York only.

ggplot(all_stations) +
  geom_sf(data = basemap_sf, fill = "black", color = NA) +
  geom_point(aes(x = lon, y = lat, color = Region), size = 1) +
  coord_sf(xlim = c(min(all_stations$lon),max(all_stations$lon)),
   ylim = c(min(all_stations$lat),max(all_stations$lat))) +
  labs(title = "Citi Bike stations") +
  mapTheme() + 
  guides(colour = guide_legend(override.aes = list(size=3)))

for (i in 0:23) {
  gridExtra::grid.arrange(
ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_time %>% 
 filter(hour == i, wknd == 0) %>%
 arrange(dest - orig),
   aes(X, Y,
   color = dest - orig,
   size  = abs(dest - orig))) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
 ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_size(range = c(0.5, 3), guide = "none") +
scale_color_gradientn(colors = c("limegreen",
 "limegreen","gray20",
 "magenta","magenta"),
  limits = c(-60,60)) +
labs(title = paste0("Existing need for rebalancing \n(hour: ",i,")"),
 color = "Excess bikes")
  )
}



The need for rebalancing comes from the spatial and temporal asymmetry in supply and demand for bikes. In the am peak, there tends to be an excess of bikes in job centers like the Financial District or Midtown, meaning that people are traveling to these areas from their residences in other parts of the city. The trend flips in the PM peak as people return home to areas like the Upper West Side and Lower East Side.

Rebalancing can be costly, requiring special vehicles, labor, and time, and can be made even more costly by inefficient routing especially during peak periods with high traffic. With such a huge system in such a large city, rebalancing becomes a formidable task, making machine learning algorithms that can predict demand and supply and allow more efficient rebalancing activity all the more valuable. Using predictive modeling and machine learning can help facilitate the efficient redistribution of bikes where and when they are needed, meeting customer demand and helping boost ridership while ensuring efficient use of resources.

A new and improved app

To facilitate the rebalancing of bikes in the Citi Bike system, we are proposing a new and improved Citi Bike app. The app will have functionality both for operators, who will be redistributing multiple bikes at a time in special vehicles, and for users who can redistribute bikes in exchange for points and perks.

app

Operators

The app, using predicted demand and supply for each station, will inform operators of the number of excess bikes or docks at any station in real time. Stations will be assigned priority values, based on the model-derived immediacy of their need for rebalancing. The app will also have a routing capability, creating efficient routes for operators to follow to pick up and drop off bikes, while taking into account each station’s rebalancing priority level.

Customers

The customer interface will provide similar information to the operator version of the app, but in a gamified manner. Rather than displaying the number of bikes needed to be rebalanced at each station, the app will show customers point values derived from the rebalancing priority level of each station. When members rebalance bikes from or to these stations, the corresponding point value will be added to their account, as part of Citi Bike’s Bike Angel program. These points will then be redeemable for various rewards, including extended memberships. This version of the app will also have a routing capability, but one that is based on an origin and destination inputted by the user. When displaying routing options, however, the app will provide an option for the most direct route, as well as an option for a route that prioritizes stations that need rebalancing.


Data exploration

To find out if any particular station needs rebalancing, we create two models predicting hourly demand for and supply of bikes to each station. The amount of bikes needed to be rebalanced at any given station at any given hour is the difference between the two values. If demand exceeds supply, bikes need to be added, and vice versa.

Predicting demand and supply

To elaborate, the demand of bikes at a Citi Bike station is equal to the number of trips that originated from that station during that hour. The supply of bikes is the number of trips whose destination is that station. These values can be aggregated from publicly available Citi Bike data (citibikenyc.com/system-data). The original data from May 2018 included 1.8 million trips.

gridExtra::grid.arrange(
  qplot(data = plot_data, x = orig, geom = "histogram") +
labs(title = "Citi Bike trip volumes", 
 subtitle = "Week of May 13 to May 19, 2018",
 x = "Hourly departures\n(demand for bikes)") +
plotTheme() +
theme(axis.text.y = element_blank()) +
coord_cartesian(xlim = c(0,50), ylim = c(4500,100000)),
  qplot(data = plot_data, x = orig, geom = "histogram") +
labs(title = "", subtitle = "",
 x = "Hourly arrivals\n(supply of bikes)") +
plotTheme() +
theme(axis.text.y = element_blank()) +
coord_cartesian(xlim = c(0,50), ylim = c(4500,100000)),
  nrow = 1)

Demand for and supply of bikes varies both spatially and temporally. On weekdays, the number of Citi Bike trips spikes at the start and end of the work day. On weekends, the number of trips gradually rises, then falls. We also expect that weather conditions affect how much people want to ride, as historical weather data shows rain on Tuesday, Wednesday, and Thursday, the days with less trips in the plot above.

ggplot(plot_hour) +
  geom_line(aes(x = hour, y = trip, 
color = wday,
linetype = wday)) +
  plotTheme() + 
  coord_cartesian(xlim = c(1,23),ylim = c(350,7650)) +
  labs(title = "Systemwide Citi Bike trips",
   subtitle = "Week of May 13 to May 19, 2018",
   y = element_blank(), x = "Time (hour)",
   color = "Day", linetype = "Day") +
  scale_linetype_manual(values = c("dashed","solid","solid",
   "solid","solid","solid",
   "dashed")) +
  scale_color_manual(values = c("black",
"red","orange","chartreuse","turquoise","blue",
"gray70"))

Auto-correlation

Certain areas like Midtown Manhattan experience a high supply of bikes during the AM peak, and a high demand for them in the PM peak. This reflects commuting behaviors as residents go to work in the morning and return at night.

demand_time <- function(TIME,WKND) {
  ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_time %>% 
 filter(hour == TIME, wknd == WKND) %>%
 arrange(orig),
   aes(X, Y,
   color = orig,
   size  = orig)) + 
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
 ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
 "limegreen","limegreen")) +
theme(legend.position="none") +
labs(subtitle = "Departures")
}
supply_time <- function(TIME,WKND) {
  ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_time %>% 
 filter(hour == TIME, wknd == WKND) %>%
 arrange(dest),
   aes(X, Y,
   color = dest,
   size  = dest)) + 
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
 ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
 "magenta","magenta")) +
theme(legend.position="none") +
labs(subtitle = "Arrivals")
}

for (i in 0:23) {
  gridExtra::grid.arrange(
demand_time(i,0) + labs(title = paste0("Weekly Citi Bike activity (hour: ",i,")")),
supply_time(i,0) + labs(title = ""),
nrow = 1
  )
}

Moran’s I Test p-values
(smaller is more significant)
Departures Arrivals
Nearest 2 0.000024 0.0000051
Nearest 5 0.006110 0.0059500
Nearest 10 0.020000 0.0930000


Each station is likely to follow the same trends every week, as commuters are likely to take the same station. Furthermore, riders also tend to choose between a variety of stations, especially when they’re clustered together. A significant p-value from the Moran’s I test indicates that nearby stations are, indeed, autocorrelated. To account for this autocorrelation, we introduce some spatial and time lag variables as well.

Summary of Variables

Category Variable Description
Dependent variables orig Trip origins (demand for bikes / departures from each station at each hour)
dest Trip destinations (supply of bikes / arrivals from each station at each hour)
Station characteristics capacity Number of docks
GEOID Census tract fixed effects should capture spatial and geographic features including land use, sociodemographic characteristics, residence and employment distribution, etc.
ZIPCODE ZIP code fixed effects
dist_nn2 Average distance to nearest 2 neighbors
dist_nn5 Average distance to nearest 5 neighbors
dist_nn10 Average distance to nearest 10 neighbors
Time wday Day of the week (Sun, Mon, etc)
wknd Trip occurred on a weekend (1 = yes, 0 = no)
hour Hour of day (0-23)
time Time of day (night = 0-5, morning = 6-9, midday = 10-15, afternoon = 16-19, evening = 20-23)
Weather temp Wet bulb temperature in degrees F (noaa.gov)
wind Wind speed in miles per hour (noaa.gov)
humid Relative humidity in percent (noaa.gov)
precip Precipitation in inches (noaa.gov)
Temporal lag *_lh Origins or destinations in the last hour
*_l3hr Origins or destinations in the last three hours
*_lw Origins or destinations this hour, one week ago
*_lw_3hr Origins or destinations in the previous three hours, one week ago
*_day Origins or destinations for this day, one week ago
*_time Origins or destinations for this time, one week ago
Spatial lag *_nn2 All temporal lag variables for the average of the nearest 2 neighbors
*_nn5 All temporal lag variables for the average of the nearest 2 neighbors
*_nn10 All temporal lag variables for the average of the nearest 2 neighbors
cap_nn2 Average capacity of nearest 2 neighbors
cap_nn5 Average capacity of nearest 5 neighbors
cap_nn10 Average capacity of nearest 10 neighbors

Results

Because demand for and supply of bikes each hour are counts with many zero values, we build our models using a Poisson regression. Poisson regressions assume that the dependent variables (in this case, demand and supply) have a Poisson distribution and can be predicted with some linear combination of contributing factors.

Using a forward stepwise procedure, we incrementally add and select the most significant contributing factors that give accurate and generalizable models. For our demand model, there are 39 independent variables; for the supply model, there are 33. Interestingly, this is a large number of variables that proved signficant; it’s likely than more feature engineering will prove helpful.

Demand model (predicting departures)

summary(mod_dep10$finalModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -21.4958   -1.3208   -0.6108    0.3840   17.3369  
## 
## Coefficients:
##                            Estimate   Std. Error  z value
## (Intercept)              -0.8453639    0.0384561  -21.983
## dist_nn2               -130.9854350    6.1392000  -21.336
## dist_nn5                131.9500930   11.1575912   11.826
## dist_nn10               -47.3410421    7.2269898   -6.551
## time.L                    0.9998751    0.0085045  117.570
## time.Q                   -1.4100442    0.0083496 -168.875
## time.C                    0.4460193    0.0054882   81.268
## `time^4`                 -0.4969285    0.0051327  -96.816
## wday.L                   -0.1151926    0.0065260  -17.651
## wday.Q                   -0.1425098    0.0095538  -14.917
## wday.C                   -0.1218532    0.0057609  -21.152
## `wday^4`                 -0.3349198    0.0060525  -55.336
## `wday^5`                 -0.0361749    0.0061012   -5.929
## `wday^6`                  0.0774497    0.0058876   13.155
## capacity                  0.0105658    0.0002123   49.777
## dest_lh                   0.0012698    0.0003861    3.289
## orig_lh                   0.0052830    0.0003255   16.229
## orig_lw                   0.0135933    0.0001545   87.974
## dest_time                -0.0151999    0.0003450  -44.056
## orig_day                  0.0478420    0.0024620   19.432
## dest_day                  0.0399845    0.0024445   16.357
## first_nn2                -0.2406566    0.0074941  -32.113
## excess_net2              -0.0035671    0.0003789   -9.414
## orig_nn2                 -0.0034958    0.0007696   -4.542
## dest_lw_nn2               0.0027711    0.0006247    4.436
## orig_day_nn2              0.0104140    0.0011826    8.806
## cap_nn5                  -0.0251648    0.0006823  -36.882
## first_nn5                 0.0174613    0.0075823    2.303
## excess_net5               0.0023500    0.0002620    8.968
## orig_nn5                  0.0142640    0.0013850   10.299
## orig_lh_nn5              -0.0031417    0.0010839   -2.898
## dest_time_nn5            -0.0045999    0.0011388   -4.039
## cap_nn10                  0.0433325    0.0008807   49.202
## first_nn10                0.0604756    0.0055394   10.917
## excess_net10              0.0050084    0.0001676   29.874
## orig_nn10                 0.0844361    0.0015702   53.773
## orig_lh_nn10              0.0219792    0.0015074   14.581
## dest_lh_nn10             -0.0204355    0.0010888  -18.768
## dest_lw_nn10             -0.0214091    0.0010711  -19.988
## orig_time_nn10           -0.0188923    0.0012985  -14.549
## orig_day_nn10             0.0417717    0.0098312    4.249
## dest_day_nn10            -0.0488454    0.0094126   -5.189
## HOURLYWETBULBTEMPF        0.0057227    0.0005121   11.175
## HOURLYWindSpeed          -0.0201322    0.0009108  -22.103
## HOURLYRelativeHumidity   -0.0035339    0.0001894  -18.654
## precip                    1.1088035    0.0936744   11.837
## dest_l3hr                 0.0079671    0.0005163   15.431
## orig_l3hr                 0.0011425    0.0005072    2.253
##                                    Pr(>|z|)    
## (Intercept)            < 0.0000000000000002 ***
## dist_nn2               < 0.0000000000000002 ***
## dist_nn5               < 0.0000000000000002 ***
## dist_nn10                   0.0000000000573 ***
## time.L                 < 0.0000000000000002 ***
## time.Q                 < 0.0000000000000002 ***
## time.C                 < 0.0000000000000002 ***
## `time^4`               < 0.0000000000000002 ***
## wday.L                 < 0.0000000000000002 ***
## wday.Q                 < 0.0000000000000002 ***
## wday.C                 < 0.0000000000000002 ***
## `wday^4`               < 0.0000000000000002 ***
## `wday^5`                    0.0000000030458 ***
## `wday^6`               < 0.0000000000000002 ***
## capacity               < 0.0000000000000002 ***
## dest_lh                             0.00101 ** 
## orig_lh                < 0.0000000000000002 ***
## orig_lw                < 0.0000000000000002 ***
## dest_time              < 0.0000000000000002 ***
## orig_day               < 0.0000000000000002 ***
## dest_day               < 0.0000000000000002 ***
## first_nn2              < 0.0000000000000002 ***
## excess_net2            < 0.0000000000000002 ***
## orig_nn2                    0.0000055614405 ***
## dest_lw_nn2                 0.0000091581861 ***
## orig_day_nn2           < 0.0000000000000002 ***
## cap_nn5                < 0.0000000000000002 ***
## first_nn5                           0.02128 *  
## excess_net5            < 0.0000000000000002 ***
## orig_nn5               < 0.0000000000000002 ***
## orig_lh_nn5                         0.00375 ** 
## dest_time_nn5               0.0000536610023 ***
## cap_nn10               < 0.0000000000000002 ***
## first_nn10             < 0.0000000000000002 ***
## excess_net10           < 0.0000000000000002 ***
## orig_nn10              < 0.0000000000000002 ***
## orig_lh_nn10           < 0.0000000000000002 ***
## dest_lh_nn10           < 0.0000000000000002 ***
## dest_lw_nn10           < 0.0000000000000002 ***
## orig_time_nn10         < 0.0000000000000002 ***
## orig_day_nn10               0.0000214847612 ***
## dest_day_nn10               0.0000002110048 ***
## HOURLYWETBULBTEMPF     < 0.0000000000000002 ***
## HOURLYWindSpeed        < 0.0000000000000002 ***
## HOURLYRelativeHumidity < 0.0000000000000002 ***
## precip                 < 0.0000000000000002 ***
## dest_l3hr              < 0.0000000000000002 ***
## orig_l3hr                           0.02429 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 687401  on 124383  degrees of freedom
## Residual deviance: 263138  on 124336  degrees of freedom
## AIC: 456626
## 
## Number of Fisher Scoring iterations: 6

Supply model (predicting arrivals)

summary(mod_arr10$finalModel)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -25.8662   -1.3366   -0.6208    0.4024   11.0435  
## 
## Coefficients:
##                            Estimate   Std. Error  z value
## (Intercept)              -0.9427152    0.0386508  -24.391
## dist_nn2               -158.7799439    6.1280266  -25.910
## dist_nn5                177.4586871   11.1610849   15.900
## dist_nn10               -65.2706028    7.2474390   -9.006
## time.L                    1.0456662    0.0082816  126.265
## time.Q                   -1.3079668    0.0081156 -161.168
## time.C                    0.3854922    0.0054901   70.216
## `time^4`                 -0.4434657    0.0050527  -87.767
## wday.L                   -0.1164255    0.0064981  -17.917
## wday.Q                   -0.0572144    0.0095759   -5.975
## wday.C                   -0.1189027    0.0057780  -20.579
## `wday^4`                 -0.3256106    0.0060630  -53.705
## `wday^5`                  0.0093756    0.0061112    1.534
## `wday^6`                  0.0775754    0.0059298   13.082
## capacity                  0.0099184    0.0002122   46.748
## dest_lh                   0.0037569    0.0002421   15.518
## orig_lh                   0.0031754    0.0003511    9.044
## dest_lw                   0.0144099    0.0002437   59.140
## orig_lw                  -0.0019523    0.0002807   -6.956
## orig_time                -0.0200494    0.0005018  -39.958
## dest_time                 0.0141580    0.0004610   30.712
## orig_day                  0.0643461    0.0023072   27.889
## dest_day                  0.0182646    0.0023065    7.919
## first_nn2                -0.2482067    0.0074578  -33.281
## excess_net2              -0.0042153    0.0004157  -10.140
## orig_nn2                 -0.0097636    0.0008516  -11.465
## orig_lw_nn2               0.0034259    0.0008103    4.228
## dest_lw_nn2               0.0064262    0.0007490    8.580
## orig_time_nn2            -0.0085416    0.0011562   -7.388
## orig_day_nn2              0.0192243    0.0016494   11.655
## cap_nn5                  -0.0222072    0.0007074  -31.393
## first_nn5                 0.0321000    0.0075764    4.237
## excess_net5               0.0070572    0.0002575   27.407
## orig_nn5                  0.0232214    0.0016445   14.121
## orig_lh_nn5              -0.0044842    0.0011685   -3.838
## orig_lw_nn5               0.0045396    0.0016172    2.807
## dest_lw_nn5              -0.0131555    0.0013655   -9.634
## orig_time_nn5             0.0177190    0.0023314    7.600
## orig_day_nn5             -0.0247999    0.0033243   -7.460
## cap_nn10                  0.0403879    0.0008993   44.911
## first_nn10                0.0682479    0.0055322   12.336
## orig_nn10                 0.0587683    0.0014771   39.787
## orig_lh_nn10              0.0240463    0.0013199   18.219
## orig_lw_nn10             -0.0124289    0.0016537   -7.516
## dest_lw_nn10              0.0094134    0.0012289    7.660
## orig_time_nn10           -0.0514587    0.0023060  -22.315
## orig_day_nn10             0.0142676    0.0032913    4.335
## HOURLYWETBULBTEMPF        0.0096334    0.0005148   18.712
## HOURLYWindSpeed          -0.0185605    0.0009051  -20.506
## HOURLYRelativeHumidity   -0.0049450    0.0001874  -26.385
## precip                    0.5608956    0.0949327    5.908
## orig_l3hr                 0.0036426    0.0004503    8.090
##                                    Pr(>|z|)    
## (Intercept)            < 0.0000000000000002 ***
## dist_nn2               < 0.0000000000000002 ***
## dist_nn5               < 0.0000000000000002 ***
## dist_nn10              < 0.0000000000000002 ***
## time.L                 < 0.0000000000000002 ***
## time.Q                 < 0.0000000000000002 ***
## time.C                 < 0.0000000000000002 ***
## `time^4`               < 0.0000000000000002 ***
## wday.L                 < 0.0000000000000002 ***
## wday.Q                 0.000000002303087518 ***
## wday.C                 < 0.0000000000000002 ***
## `wday^4`               < 0.0000000000000002 ***
## `wday^5`                           0.124988    
## `wday^6`               < 0.0000000000000002 ***
## capacity               < 0.0000000000000002 ***
## dest_lh                < 0.0000000000000002 ***
## orig_lh                < 0.0000000000000002 ***
## dest_lw                < 0.0000000000000002 ***
## orig_lw                0.000000000003503069 ***
## orig_time              < 0.0000000000000002 ***
## dest_time              < 0.0000000000000002 ***
## orig_day               < 0.0000000000000002 ***
## dest_day               0.000000000000002399 ***
## first_nn2              < 0.0000000000000002 ***
## excess_net2            < 0.0000000000000002 ***
## orig_nn2               < 0.0000000000000002 ***
## orig_lw_nn2            0.000023598317609794 ***
## dest_lw_nn2            < 0.0000000000000002 ***
## orig_time_nn2          0.000000000000149121 ***
## orig_day_nn2           < 0.0000000000000002 ***
## cap_nn5                < 0.0000000000000002 ***
## first_nn5              0.000022670227377094 ***
## excess_net5            < 0.0000000000000002 ***
## orig_nn5               < 0.0000000000000002 ***
## orig_lh_nn5                        0.000124 ***
## orig_lw_nn5                        0.005001 ** 
## dest_lw_nn5            < 0.0000000000000002 ***
## orig_time_nn5          0.000000000000029558 ***
## orig_day_nn5           0.000000000000086468 ***
## cap_nn10               < 0.0000000000000002 ***
## first_nn10             < 0.0000000000000002 ***
## orig_nn10              < 0.0000000000000002 ***
## orig_lh_nn10           < 0.0000000000000002 ***
## orig_lw_nn10           0.000000000000056568 ***
## dest_lw_nn10           0.000000000000018582 ***
## orig_time_nn10         < 0.0000000000000002 ***
## orig_day_nn10          0.000014581349814439 ***
## HOURLYWETBULBTEMPF     < 0.0000000000000002 ***
## HOURLYWindSpeed        < 0.0000000000000002 ***
## HOURLYRelativeHumidity < 0.0000000000000002 ***
## precip                 0.000000003455468658 ***
## orig_l3hr              0.000000000000000597 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 693982  on 124383  degrees of freedom
## Residual deviance: 259189  on 124332  degrees of freedom
## AIC: 452548
## 
## Number of Fisher Scoring iterations: 6

We use 100-fold cross-validation, an out-of-sample validation method that randomly separates the data into 100 portions.

To further test whether our models can predict out-of-sample, we use our models to predict demand for and supply of bikes at all stations on Friday, May 25 and Saturday, May 26 the following week. By subtracting demand from supply, we can determine the excess number of Citi Bikes at any given station at all hours of the day.

plot_result <- output %>%
  left_join(all_stations %>% select(station_id,lon,lat)) %>%
  mutate(exc  = dest - orig,
         exc  = ifelse(exc > 50, 50, exc),
         orig = ifelse(orig > 50, 50, orig),
         dest = ifelse(dest > 50, 50, dest))

demand_time <- function(DATE) {
  ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_result %>% 
 filter(date_id == DATE) %>%
 arrange(orig),
   aes(lon, lat,
   color = orig,
   size  = orig)) + 
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
 ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
 "limegreen","limegreen")) +
theme(legend.position="none") +
labs(subtitle = "Departures")
}
supply_time <- function(DATE) {
  ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_result %>% 
 filter(date_id == DATE) %>%
 arrange(dest),
   aes(lon, lat,
   color = dest,
   size  = dest)) + 
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
 ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("gray20",
 "magenta","magenta")) +
theme(legend.position="none") +
labs(subtitle = "Arrivals")
}
excess_time <- function(DATE) {
  ggplot() +
geom_sf(data = basemap_sf, fill = "black", color = NA) +
geom_point(data = plot_result %>% 
 filter(date_id == DATE) %>%
 arrange(abs(exc)),
   aes(lon, lat,
   color = exc,
   size  = abs(exc))) + 
scale_size(range = c(0.5, 3)) +
coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
 ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
mapTheme() +
scale_color_gradientn(colors = c("limegreen","limegreen","gray20",
 "magenta","magenta")) +
theme(legend.position="none") +
labs(subtitle = "Excess")
}

for (i in unique(output$date_id)) {
  
  date <- as.POSIXlt(i, origin="1970-01-01", tz="America/New_York") + dhours(4)
  date <- paste0(wday(date, label = TRUE)," ",hour(date),":00")
  
  gridExtra::grid.arrange(
    demand_time(i),
    supply_time(i),
    excess_time(i),
    nrow = 1,
    top = grid::textGrob(paste0("Predicted activity (",date,")"),
                         gp=grid::gpar(fontsize=14),
                         hjust = 1)
  )
}

Top 10 stations and times in need of rebalancing

Negative numbers mean a shortage of bikes.

output %>% 
  filter(date(date_id) == ymd("2018-05-25")) %>%
  mutate(hour = paste0(hour(date_id),":00")) %>%
  arrange(desc(abs(exc_pred))) %>%
  head(10) %>%
  arrange(hour(date_id)) %>%
  left_join(all_stations %>% select(station_id,name)) %>%
  select(Station = name,
         Time = hour, 
         `Excess bikes` = exc_pred) %>%
  kable(caption = "Friday, May 25", digits = 1) %>%
  kable_styling(bootstrap_options = c("hover"))
Friday, May 25
Station Time Excess bikes
W 33 St & 7 Ave 7:00 -164.2
W 33 St & 7 Ave 8:00 -85.0
W 52 St & 5 Ave 8:00 68.0
Broadway & E 22 St 9:00 71.5
W 33 St & 7 Ave 15:00 67.9
W 33 St & 7 Ave 16:00 607.2
Pershing Square North 17:00 -197.3
W 33 St & 7 Ave 17:00 195.7
W 31 St & 7 Ave 17:00 151.1
Pershing Square North 18:00 -115.6
output %>% 
  filter(date(date_id) == ymd("2018-05-26")) %>%
  mutate(hour = paste0(hour(date_id),":00")) %>%
  arrange(desc(abs(exc_pred))) %>%
  head(10) %>%
  arrange(hour(date_id)) %>%
  left_join(all_stations %>% select(station_id,name)) %>%
  select(Station = name,
         Time = hour,  
         `Excess bikes` = exc_pred) %>%
  kable(caption = "Saturday, May 26", digits = 1) %>%
  kable_styling(bootstrap_options = c("hover"))
Saturday, May 26
Station Time Excess bikes
E 15 St & 3 Ave 12:00 -3.1
Grand Army Plaza & Central Park S 16:00 -4.4
St Marks Pl & 1 Ave 16:00 3.8
Broadway & W 49 St 16:00 2.9
E 11 St & 1 Ave 16:00 2.5
E 15 St & 3 Ave 17:00 3.2
6 Ave & Canal St 17:00 2.5
St Marks Pl & 1 Ave 18:00 4.1
Grand Army Plaza & Central Park S 18:00 -3.5
St Marks Pl & 1 Ave 19:00 3.0

Validation

In the predictions for Friday above, we can immediately tell that the models need to be validated, because some stations report an exorbitant number upwards of 600 bikes that need to be rebalanced. Two areas of model validation are accuracy, or the magnitude of error, and generalizability, or the distribution of error across variables.

Model accuracy

General model accuracy can be measured using Mean Absolute Error (MAE), which a type of average that can measure the magnitude of error even when some error values are negative. A second type of error, Mean Bias Error (MBE) takes the average without adjusting for negative values. This type of error can reveal whether, on average, predicted values are higher (MBE > 0) or lower (MBE < 0) than observed values.

The table below indicates the error for our out-sample prediction. An MAE of about 2 means that on average, the models’ predictions are off by 2 bikes. On weekends, both models have a negative MBE, indicating that they are likely to under-predict the number of arrivals and departures.

Day Demand MAE Demand MBE Supply MAE Supply MBE
Fri 2.33 0.04 2.38 0.20
Sat 2.12 -1.17 2.10 -0.99
Both days 2.22 -0.57 2.24 -0.39

Other types of error include Mean Average Percent Error (MAPE) and Root Mean Square Error (RMSE). However, MAPE is unsuitable to measure deviation from zero and RMSE over-emphasizes high error values. The absolute magnitude of error (i.e. MAE) is most useful because it reflects, on average, the number of bikes per station each hour that our model mis-predicts.

Model generalizability

Model generalizability is a measure of error that shows not the magnitude but the variation in error at different times, places, or across some variable. First, we can see the distribution of error across the 100 random folds from our 100-fold regression. For most folds, the error is clustered near the MAE of the final models. Both the supply and demand models are right-tailed, indicating that some folds have a particularly high error. The next few sections look into exactly where and when our models are underperforming.

gridExtra::grid.arrange(
ggplot(mod_dep10$resample) +
  geom_histogram(aes(x = MAE)) + 
  plotTheme() + 
  coord_cartesian(xlim = c(1.5,2.5), ylim = c(1,25)) + 
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank()) +
  labs(title = "Distribution of error across 100 random folds",
       subtitle = "Demand (departures)"),
ggplot(mod_arr10$resample) +
  geom_histogram(aes(x = MAE)) + 
  plotTheme() + 
  coord_cartesian(xlim = c(1.5,2.5), ylim = c(1,25)) + 
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank()) +
  labs(title = "",
       subtitle = "Supply (arrivals)"),
nrow = 1)

Error across time

Next is a comparison of error over time. On weekdays, both models perform relatively well, typically sligntly under-predicting in the morning, then over-predicting during the PM peak. On weekends, however, the models systematically under predict the number of arrivals and departures at each station. This may indicate that some time-based variable needs to be added. Another possibility includes making separate models for weekdays and weekends, as it’s been illustrated that weekdays and weekends see very different activity patterns.

time_err <- output %>%
  group_by(date_id) %>%
  summarize(orig      = mean(orig, na.rm = TRUE),
            orig_pred = mean(orig_pred, na.rm = TRUE),
            orig_err  = mean(orig_pred - orig, na.rm = TRUE),
            orig_mae  = mean(abs(orig_pred - orig),na.rm = TRUE),
            dest      = mean(dest, na.rm = TRUE),
            dest_pred = mean(dest_pred, na.rm = TRUE),
            dest_err  = mean(dest_pred - dest, na.rm = TRUE),
            dest_mae  = mean(abs(dest_pred - dest),na.rm = TRUE)) %>%
  mutate(hour = hour(date_id))
gridExtra::grid.arrange(
ggplot(data = time_err %>%
         rbind(time_err %>% 
                 filter(hour == 0) %>%
                 mutate(hour = 24)) %>% 
         filter(wday(date_id) == 6)) + 
  geom_area(aes(x = hour, y = orig), fill = "gray", alpha = 0.25) +
  geom_line(aes(x = hour, y = orig), color = "gray",size = 1) + 
  geom_area(aes(x = hour, y = orig_pred), fill = "limegreen", alpha = 0.25) +
  geom_line(aes(x = hour, y = orig_pred),
            color = "limegreen", size = 1) + 
  geom_line(aes(x = hour, y = (orig_err)),
            color = "black", size = 1, linetype = "1111") +
  plotTheme() + 
  scale_x_continuous(breaks = c(0,6,12,18,24)) +
  coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
  labs(title = "Friday predictions by hour",
       subtitle = "Demand (departures)",
       x = "Hour", y = "Bikes"),
ggplot(data = time_err %>%
         rbind(time_err %>% 
                 filter(hour == 0) %>%
                 mutate(hour = 24)) %>%
         filter(wday(date_id) == 6)) + 
  geom_area(aes(x = hour, y = dest), fill = "gray", alpha = 0.25) +
  geom_line(aes(x = hour, y = dest), color = "gray",size = 1) + 
  geom_area(aes(x = hour, y = dest_pred), fill = "magenta", alpha = 0.25) +
  geom_line(aes(x = hour, y = dest_pred),
            color = "magenta", size = 1) + 
  geom_line(aes(x = hour, y = (dest_err)),
            color = "black", size = 1, linetype = "1111") +
  plotTheme() + 
  scale_x_continuous(breaks = c(0,6,12,18,24)) +
  coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
  labs(title = "(error in black)",
       subtitle = "Supply (arrivals)",
       x = "Hour", y = ""),
nrow = 1)

gridExtra::grid.arrange(
ggplot(data = time_err %>% 
         rbind(time_err %>% 
                 filter(hour == 0) %>%
                 mutate(hour = 24)) %>%
         filter(wday(date_id) == 7)) + 
  geom_area(aes(x = hour, y = orig), fill = "gray", alpha = 0.25) +
  geom_line(aes(x = hour, y = orig), color = "gray", size = 1) + 
  geom_area(aes(x = hour, y = orig_pred), fill = "limegreen", alpha = 0.25) +
  geom_line(aes(x = hour, y = orig_pred),
            color = "limegreen", size = 1) + 
  geom_line(aes(x = hour, y = (orig_err)),
            color = "black", size = 1, linetype = "1111") +
  plotTheme() + 
  scale_x_continuous(breaks = c(0,6,12,18,24)) +
  coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
  labs(title = "Saturday predictions by hour",
       subtitle = "Demand (departures)",
       x = "Hour", y = "Bikes"),
ggplot(data = time_err %>% 
         rbind(time_err %>% 
                 filter(hour == 0) %>%
                 mutate(hour = 24)) %>%
         filter(wday(date_id) == 7)) + 
  geom_area(aes(x = hour, y = dest), fill = "gray", alpha = 0.25) +
  geom_line(aes(x = hour, y = dest), color = "gray",size = 1) + 
  geom_area(aes(x = hour, y = dest_pred), fill = "magenta", alpha = 0.25) +
  geom_line(aes(x = hour, y = dest_pred),
            color = "magenta", size = 1) + 
  geom_line(aes(x = hour, y = (dest_err)),
            color = "black", size = 1, linetype = "1111") +
  plotTheme() + 
  scale_x_continuous(breaks = c(0,6,12,18,24)) +
  coord_cartesian(xlim = c(1,23),ylim = c(-3,10)) +
  labs(title = "(error in black)",
       subtitle = "Supply (arrivals)",
       x = "Hour", y = ""),
nrow = 1)

Error across space

Lastly, the maps below indicate that stations in central Manhattan are likely to be predicted with significant error. These are also high-volume stations that see a lot of activity, so it might be expected that the error is higher as well.

space_err <- output %>%
  group_by(station_id) %>%
  summarize(orig_err = mean(abs(orig_pred - orig), na.rm = TRUE),
            dest_err = mean(abs(dest_pred - dest), na.rm = TRUE)) %>%
  left_join(all_stations %>% select(station_id,lon,lat))
gridExtra::grid.arrange(
ggplot() + 
  geom_sf(data = basemap_sf, fill = "black", color = NA) +
  geom_point(data = space_err %>% arrange(orig_err), 
             aes(color = orig_err, size = orig_err,
                 x = lon, y = lat)) + 
  coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
           ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
  mapTheme() + 
  scale_color_gradientn(colors = c("gray20","red","red","red","red","red","red"),
                        limits = c(0,40)) +
  scale_size(range = c(0.5,3)) + 
  labs(title = "Mean absolute error by station",
       subtitle = "Demand (departures)") +
  theme(legend.position = "none"),
ggplot() + 
  geom_sf(data = basemap_sf, fill = "black", color = NA) +
  geom_point(data = space_err %>% arrange(dest_err), 
             aes(color = dest_err, size = dest_err,
                 x = lon, y = lat)) + 
  coord_sf(xlim = c(min(nyc_stations$lon),max(nyc_stations$lon)),
           ylim = c(min(nyc_stations$lat),max(nyc_stations$lat))) +
  mapTheme() + 
  scale_color_gradientn(colors = c("gray20","red","red","red","red","red","red"),
                        limits = c(0,40)) +
  scale_size(range = c(0.5,3)) + 
  labs(title = "",subtitle = "Supply (arrivals)") +
  theme(legend.position = "none"),
nrow = 1)


Next steps

Next steps to improve the model could include:

Building a model to predict the excess bike demand at each station (demand-supply), rather than deriving the predicted excess demand from two separate models predicting demand and supply. Directly predicting this variable may have the potential to provide more robust predictions.

Expanding the model to take into account the current availability of bikes and docks - when combined with predicted excess demand values, this will allow the model to meet the predictive needs of the app and convey information on how many bikes are projected to be available throughout the day.